Coding for DS and DM
R coding module

Lecture 3

Andrea Cappozzo
andrea.cappozzo@unimi.it
AndreaCappozzo
andreacappozzo.rbind.io

Simple math for data structures

  • Mathematics on vectors and matrices are performed element-wise (Vectorized operations)
  • If two vectors are of unequal length, the shorter one will be recycled in order to match the longer vector. Be careful: NO WARNING for this!

Simple math for data structures

  • For example, the following vectors x and y have different lengths, and their sum is computed by recycling values of the shorter vector.
x <- c(100, 200, 300, 400, 500)
y <- c(0, 1, 2, 3, 4, 5, 6, 7)
## vector + vector
x + y
Warning in x + y: longer object length is not a multiple of shorter object
length
[1] 100 201 302 403 504 105 206 307

Simple math for data structures

  • Summing a scalar to a vector:
a <-  1
x + a
[1] 101 201 301 401 501
  • Multiplying a scalar to a vector:
b <-  2
x * b
[1]  200  400  600  800 1000

Simple math for data structures

  • Summing a scalar to a matrix:
M <- matrix(1:16, ncol = 4, nrow = 4, byrow = TRUE)
a <-  1
M + a
     [,1] [,2] [,3] [,4]
[1,]    2    3    4    5
[2,]    6    7    8    9
[3,]   10   11   12   13
[4,]   14   15   16   17
  • Multiplying a scalar to a matrix:
b <-  2
M * b
     [,1] [,2] [,3] [,4]
[1,]    2    4    6    8
[2,]   10   12   14   16
[3,]   18   20   22   24
[4,]   26   28   30   32

Simple math for data structures

  • Summing and Multiplying a matrix to a vector:
M + x
Warning in M + x: longer object length is not a multiple of shorter object
length
     [,1] [,2] [,3] [,4]
[1,]  101  502  403  304
[2,]  205  106  507  408
[3,]  309  210  111  512
[4,]  413  314  215  116
M * x
Warning in M * x: longer object length is not a multiple of shorter object
length
     [,1] [,2] [,3] [,4]
[1,]  100 1000 1200 1200
[2,] 1000  600 3500 3200
[3,] 2700 2000 1100 6000
[4,] 5200 4200 3000 1600

Simple math for data structures

  • Summing two matrices:
N <-  M <- matrix(1:20*10, ncol = 5, nrow = 4, byrow = TRUE)
M + N
     [,1] [,2] [,3] [,4] [,5]
[1,]   20   40   60   80  100
[2,]  120  140  160  180  200
[3,]  220  240  260  280  300
[4,]  320  340  360  380  400

Simple math for data structures

  • Matrix product:
M %*% x
       [,1]
[1,]  55000
[2,] 130000
[3,] 205000
[4,] 280000
  • Be careful using the symbol * with matrices:
M * x
      [,1]  [,2]  [,3]  [,4]  [,5]
[1,]  1000 10000 12000 12000 1e+04
[2,] 12000  7000 40000 36000 3e+04
[3,] 33000 24000 13000 70000 6e+04
[4,] 64000 51000 36000 19000 1e+05

Simple math for data structures

  • Proper vector multiplication:
x <- c(100, 200, 300, 400, 500)
y <- c(1, 2, 3, 4, 5)
x %*% y # equivalent to t(x)%*%y
     [,1]
[1,] 5500
  • Outer product of vectors
x %o% y # equivalent to x%*%t(y)
     [,1] [,2] [,3] [,4] [,5]
[1,]  100  200  300  400  500
[2,]  200  400  600  800 1000
[3,]  300  600  900 1200 1500
[4,]  400  800 1200 1600 2000
[5,]  500 1000 1500 2000 2500

Control structures (1)

Chunks of code that aim to control the execution of code based on a condition.

  • If statement:
if(7==7){
  print('Seven is equal to seven! Unbelievable!')
}
[1] "Seven is equal to seven! Unbelievable!"
  • If-else statement:
x <- 7; y <- 5; z = x + y
if (z == 13) {
  print("True")
} else {
  print("False")
}
[1] "False"
  • Using ifelse function:
x <- 7
y <- 5
z = x + y
ifelse(z == 13, print("True"), print("False"))
[1] "False"
[1] "False"

Control structures (2)

  • If-else-else statement:

The if-else combination is commonly used to test conditions and handle results depending on the evaluation.

x <- 5
y <- 5
if(x > y) {
  print("x is greater")
} else if(x < y) {
  print("y is greater")
} else {
  print("x and y are equal")
}
[1] "x and y are equal"

Control structures (3)

  • Nested if statement:
x <- 7
y <- 5
z <- 2
if(x > y) {
  print("x is greater than y")
  if(x > z) {
    print("x is greater than y and z")
  }
}
[1] "x is greater than y"
[1] "x is greater than y and z"

Control structures (4)

  • In R, conditional statements are not vectorized operations. They handle single values.
  • If a vector is passed into an if statement, it checks only the first element and gives an error for multiple conditions.

Example:

v <- c(1,2,3,4,5,6)

if (v %% 2 == 0) {
  print("odd")
}

Error in if (v %% 2) : the condition has length > 1

Control structures (5)

The ifelse() function checks a condition for every element in a vector.

v <- c(1,2,3,4,5,6)
ifelse(v %% 2 == 0, "even", "odd")
[1] "odd"  "even" "odd"  "even" "odd"  "even"
  • You can also use ifelse to choose between two vectors.
v1 <- c(1,2,3,4,5,6)
v2 <- c("a","b","c","d","e","f")
ifelse(c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE), v1, v2)
[1] "1" "b" "3" "d" "5" "f"
  • Extra: the case_when function from the dplyr package enables the vectorized evaluation of multiple if and else if conditions.

Logical operators in control structures

  • Multiple conditions can be used with logical operators.
  • AND (&&), OR (||), and NOT (!) are used for these conditions.
x <- 7
y <- 5
z <- 2
if(x > y && x > z) {
  print("Yes, x is the greatest number")
}
[1] "Yes, x is the greatest number"
x <- 7
y <- 5
z <- 9
if(x > y || x > z) {
  print("x is greater than y or z")
}
[1] "x is greater than y or z"

Still on the if clause

  • Compute the absolute value of x and assign it to y:
x <- -7
if(x < 0) {
  y <- (-x)
} else { 
  y <- x
}
y
[1] 7
  • Compare with the abs() function:
(y <- abs(x))
[1] 7

Other control clauses

The which() function returns the array indeces that meet a specific condition. Let us firstly contruct a dataframe

v1 <- c(10, 20, 30)  ## numeric vector
v2 <- c('a', 'b', 'c')  ## character vector
v3 <- c(TRUE, TRUE, FALSE)  ## logical vector

my_data <- data.frame('c1' = v1, 
                      'c2' = v2, 
                      'c3' = v3, 
                      stringsAsFactors = FALSE)
my_data
  c1 c2    c3
1 10  a  TRUE
2 20  b  TRUE
3 30  c FALSE

Other control clauses

Get the row numbers where column c1 is greater than or equal to 20:

which(my_data$c1 >= 20)
[1] 2 3

It works with matrices and arrays in general

which(M == max(M), arr.ind = TRUE) 
     row col
[1,]   4   5

Special conditional operators (1)

  • The %in% operator can be used to identify if an element (e.g., a number) belongs to a vector or dataframe.
## Sequences of Letters:
a <- LETTERS[1:10]
## Second seq of letters
b <- LETTERS[7:12]
## longer in shorter
b %in% a
[1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE

The split function

  • The split() function takes a vector or other objects and splits it into groups defined by a factor.
## Generates 5 values from a standard normal, 5 values from
## uniform distribution (0,1), 5 values from a normal (1,sqrt(2)):
x <- c(rnorm(n = 5), runif(n = 5), rnorm(n = 5, mean = 1,sd = 2))
f <- gl(3, 5) ## Generate levels (as.factor(rep(1:3, each=10)))
my_data <- split(x, f)
my_data
$`1`
[1] -0.4727550 -1.5070332  0.4366664 -0.1690388 -1.0510129

$`2`
[1] 0.57801036 0.17650213 0.05488572 0.75356149 0.98190205

$`3`
[1]  0.7238610  3.4132464  2.0042869 -1.4766671 -0.3637311
mean(my_data$'1')
[1] -0.5526347

The ‘for’ control structure (1)

  • For loops take an iterator variable and assign it successive values from a sequence or vector.
  • For loops are most commonly used for iterating over the elements of an object (list, vector, etc.) and doing some operations in the body of the loop.
for(i in 1:5){
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

The ‘for’ control structure (2)

  • You can also use a vector containing a sequence of integers:
myvector <- 1:5 
for(i in myvector){
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5

The ‘for’ control structure (3)

  • You can also use seq() to generate a sequence:
  • With seq(), you can also iterate with some different step lengths:
myvector <- seq(from=1,to=10, by = 2)
for(i in myvector){
  print(i)
}
[1] 1
[1] 3
[1] 5
[1] 7
[1] 9

The ‘for’ control structure (4)

  • You can use range() within seq():
my_range <- range(1,10)
myvector <- seq(my_range[1], my_range[2], by = 2)
for(i in myvector){
  print(i)
}
[1] 1
[1] 3
[1] 5
[1] 7
[1] 9
  • For more complex loops, you can use primitives like seq_along()

The ‘for’ control structure (5)

  • What can you do in the for body?
  • For example, compute a mean.
mymarks <- c(23,29,30,30,21,25,27,30,39,19)
len_mymarks <- length(mymarks)
my_sum <- 0
for(i in (1:len_mymarks)){
  my_sum <- my_sum + mymarks[i]
}
print(paste("The sum is: ", my_sum))
[1] "The sum is:  273"
print(paste("The mean is: " , my_sum/len_mymarks))
[1] "The mean is:  27.3"
  • Note: In R, it is generally recommended to avoid loops and instead opt for vectorized operations for better performance and efficiency.

The ‘while’ control structure (1)

  • While loops begin by testing a condition.
  • If it is true, then it executes the loop body.
  • After the loop body is executed, the condition is re-evaluated. This process repeats until the condition becomes false, at which point the loop terminates.
val <- 1
while(val < 5) {
  val <- val + 1
  print(val)
}
[1] 2
[1] 3
[1] 4
[1] 5

The ‘while’ control structure (3)

  • Warning! While loops can potentially result in infinite loops if not written properly. Use with care!
val <- 6
while(val > 5) {
  val <- val + 1
  print(val)
}

The ‘break’ statement

  • The break statement allows you to exit any loop according to some condition.
val <- 6
iter_max <- 0
while(val > 5) {
  val <- val + 1
  print(val)
  iter_max <- iter_max +1
  if(iter_max>=100){
    break
  }
}

The ‘next’ statement

  • The next statement enables you to skip the current iteration of a loop without terminating it.
  • It jumps to the evaluation of the condition holding the current loop.
x <- 1:4
for (i in x) {
  if (i == 2) {
    next
  }
  print(i)
}
[1] 1
[1] 3
[1] 4

Loop functionals (1)

  • R has some for-loop replacements to make your life easier.
  • They are called functionals.
  • Functional is a function that takes a function as an input and returns a vector as output
  • A detailed and very clear explanation is provided here
  • In what follows we will cover the use of base R lapply(), apply(), and tapply().
  • Enhanced versions of these commands are available in the purrr package.

Meme of the day

Loop functionals: lapply (2)

  • The lapply() functional applies a function to each element of a list, returning a list
  • Here is an example of applying the mean() function to all elements of a list. If the original list has names, then the names will be preserved in the output.
x <- list(a = 1:10, b = 1:100)
lapply(X = x, FUN = mean)
$a
[1] 5.5

$b
[1] 50.5

Loop functionals (3)

  • Functionals may be difficult to grasp at the beginning, because they are one step up in abstraction wrt for loops
lapply(X = 1:4, FUN = runif)
[[1]]
[1] 0.8193035

[[2]]
[1] 0.9685773 0.7217290

[[3]]
[1] 0.02957967 0.05483245 0.40914464

[[4]]
[1] 0.5048155 0.1010549 0.9509554 0.2531554
  • runif() generates random deviates from \(U(min,max)\) (with default \(min=0\) and \(max=1\) )

Loop functionals: explanation (4)

  • When you pass a function to lapply(), it takes elements of the list and passes them as the first argument of the function you are applying.
  • The first argument of runif() is n, and so the elements of the sequence 1:4 all got passed to the n argument
  • Functions that you pass to lapply() may have other arguments.
  • The runif() function has min and max arguments too.
  • Here is where the \(...\) (dot-dot-dot) argument to lapply() comes into play.

Loop functionals: several options (5)

  • You want to do the same operation as before, but now with min=0 and max=10
set.seed(33) # With for loops
res <- vector(mode = "list",length = 4)
for (i in 1:4) {
  res[[i]] <- runif(n = i,min = 0,max = 10)
}
res
[[1]]
[1] 4.459405

[[2]]
[1] 3.946503 4.837289

[[3]]
[1] 9.188760 8.438814 5.173496

[[4]]
[1] 4.3712500 3.4319822 0.1551696 1.1799116

Loop functionals: several options (6)

  • You want to do the same operation as before, but now with min=0 and max=10
set.seed(33) # With lapply using the dot-dot-dot argument
(res <- lapply(X = 1:4, FUN = runif, min=0,max=10))
[[1]]
[1] 4.459405

[[2]]
[1] 3.946503 4.837289

[[3]]
[1] 9.188760 8.438814 5.173496

[[4]]
[1] 4.3712500 3.4319822 0.1551696 1.1799116

Loop functionals: several options (7)

  • You want to do the same operation as before, but now with min=0 and max=10
set.seed(33) # With lapply explicitly defining FUN
(res <- lapply(
  X = 1:4,
  FUN = function(num)
    runif(n = num, min = 0, max = 10)
))
[[1]]
[1] 4.459405

[[2]]
[1] 3.946503 4.837289

[[3]]
[1] 9.188760 8.438814 5.173496

[[4]]
[1] 4.3712500 3.4319822 0.1551696 1.1799116

Loop functionals: sapply (8)

  • The sapply() function behaves similarly to lapply(), with the primary difference being in the returned value.
  • sapply() will try to simplify the result of lapply()
  • Essentially, sapply() calls lapply() on its input and then applies the following algorithm:
    • If the result is a list where every element is length 1, then a vector is returned;
    • If the result is a list where every element is a vector of the same length (> 1), a matrix is returned;
    • If it can’t figure things out, a list is returned.

Loop functionals: sapply (9)

  • Example returning a vector:
x <- list(a = 1:10, b = 1:100)
sapply(x, FUN = mean)
   a    b 
 5.5 50.5 

Loop functionals: sapply (10)

  • Example returning a matrix with the BOD (Biochemical Oxygen Demand) dataset in the datasets library:
BOD
  Time demand
1    1    8.3
2    2   10.3
3    3   19.0
4    4   16.0
5    5   15.6
6    7   19.8
sapply(BOD, function(x) 10 * x)
     Time demand
[1,]   10     83
[2,]   20    103
[3,]   30    190
[4,]   40    160
[5,]   50    156
[6,]   70    198